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ABSTRACT 

High  density  clusters  are  defined  on  a  population  with 
density  f  to  be  the  maximal  connected  sets  of  values  x  with  f(x)>  c, 
for  various  values  of  c.   It  is  desired  to  discover  the  high  density 
clusters  giver,  a  random  sample  of  size  N  from  the  population.   Using 
this  clustering  model,  there  is  a  correspondence  between  clustering 
and  density  estimation  techniques.   A  hybrid  algorithm  is  proposed  which 
combines  elements  of  both  the  k-means  and  single  linkage  techniques. 
This  procedure  is  practicable  for  very  large  number  of  observations, 
and  is  shown  to  be  consistent,  under  certain  regularity  conditions,  in 
one  dimension. 

KEY  WORDS:   High  density  clusters;   K-Means  clustering;   Single  linkage; 
Density  estimation;   Asymptotic  consistency;   Hybrid  clustering. 


1.   INTRODUCTION 

The  high  density  model  for  clusters  (Hartigan  1975,  p. 205) 
assumes  that  observations  x. ,  x  , . .  .  ,  x  are  sampled  from  a  population 
with  density  f  in  p  dimensional  space,  taken  with  respect  to  Lebesgue 
measure.   High  density  clusters  are  maximal  connected  sets  of  the 
form  {x  lf(x)>  c}  ,  taken  for  all  c.   The  family  T  of  such  clusters 
forms  a  tree,  in  that  Ae  T,   Bs  T  implies   A=>3,   B  =s  A  or  AH  3  =  c  . 
A  sample  hierarchical  clustering  T  on  x. ,  x.,...,  x   nay  now  be 
evaluated  by  how  well  it  approximates  T,  on  the  average.   It  may  be 
asked  whether  or  not  the  sample  clusters  converge  to  the  population 
clusters  in  some  sense.   The  procedure  T,.  is  set-consistent   for  T  if 
for  each  A,  Bel  with  AO  B  =  $   ,  there  exists  A^,    B  e  T    with 
ANoAn{    xL,....,xN}  ,  BNoBn{   xr  ....,xN},   ANHBN=$   , 
with  probability  approaching  1  as  N-v  <»  . 

Standard  hierarchical  techniques  begin  with  clusters  consisting 
of  single  points,  and  successively  join  pairs  of  clusters  which  are 
closest  according  to  some  measure  of  distance,  to  obtain  new  clusters. 
The  process  terminates  when  a  single  cluster  remains.   Complete  linkage 
(Sorenson  1948)  defines  distance  between  clusters  to  be  the  maximum 
distance  of  pairs  of  points  in  the  clusters,  and  it  is  not  set-consistent 
(Hartigan  1977a).   Average  linkage  (Sneath  and  Sokal  1973)  uses  distance 
between  clusters  as  the  average  distance  between  pairs  of  points  iv    the 
two  clusters,  and  sampling  experiments  (Hartigan  1979)  suggest  it  is  not 
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consistent  either.   Single  linkage  (Sneath  1957)  defines  distance 
between  clusters  as  the  closest  distance  between  pairs  of  points  in  the 
two  clusters.   Single  linkage  is  set  consistent  in  one  dimension  but  not 
in  higher  dimensions.   There  is  empirical  evidence  and  some  theory 
(Hartigan  1979)  to  suggest  that  single  linkage  is  consistent  in  a  weaker 
sense. 

Density  estimates  generate  clusters,  namely  the  high  density  clusters 
corresponding  to  the  estimates.   Single  linkage  corresponds  to  nearest 
neighbour  density  estimation  (Hartigan  1977b),  in  which  the  density 
estimate  at  a  point  x  is  inversely  proportional  to  the  volume  of  the 
smallest  closed  sphere  including  a  sample  point.   This  density  estimate 
is  inconsistent  in  the  sense  that  fM(x)  does  not  approach  f(x)  in  proba- 
bility.  An  improved  density  estimate,  and  perhaps  improved  clustering, 
is  obtained  by  the  kth  nearest  neighbour  density  estimate:  the  density  at 
point  x  is  inversely  proportional  to  the  volume  of  the  smallest  sphere 
containing  k  sample  points.   Such  a  density  estimate  is  consistent  at  a 
point  x  if  f  is  continuous  at  x  and  k->  °°  as  N->  °°    .   More  generally, 

1    M 

kernel  estimates  of  the  form  — Z   BLT  (x,  x.)       might  be  used  (see, 

N   i   N  1 

for  example,  Wegman  19  72)  . 

Although  the  statistical  justification  of  these  density  estimates 

.2 
require  N  very  large,  the  number  of  computations  is  usually  0(N  )  which 

begins  to  be  onerous  for  N  over  250.   In  addition,  the  actual  computation 

of  high  density  cluster  from  the  density  estimate  may  be  formidable.  A 

variation  of  kth  nearest  neighbour  from  which  clusters  may  be  constructed 

is  due  to  Wishart  (1969);   the  density  at  observation  x.  is  kth  nearest 
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neighbour  density,  points  x.  and  x.  are  connected  if  x.  is  among  the 
k  closest  points  to  x.  ,  or  if  x.  is  among  the  k  closest  points  to  x.  _ 

and  the  high  density  sets  with  this  measure  of  connectedness  are  clusters. 

2 
The  computational  expense  of  this  technique  is  0(N  ).   For  related  tech- 
niques, see  Ling  (1973)  and  Jardine  and  Sibson  (1971). 

A  hybrid  clustering  technique  is  proposed  here  which  combines  the 
k-means  (Hartigan  and  Wong  1979)  and  single  linkage  clustering  techniques. 
At  the  first  stage,  k-means  is  used  to  construct  a  variable  cell  histogram 
(with  k  cells)  which  provides  uniformly  consistent  estimates  of  the  under- 
lying density  (Wong,  1980).   At  the  second  stage,  the  high  density  clusters 
corresponding  to  the  computed  estimates  are  obtained  by  applying  single 
linkage  to  an  appropriate  distance  matrix  defined  on  the  k  cells.   A 
detailed  description  of  this  method  is  given  in  Section  2.   The  number  of 
calculations  is  0(Nk).   In  Section  3,  it  is  shown  that  the  hybrid  algorithm 
is  set-consistent  for  high  density  clusters  in  one  dimension,  under  certain 
regularity  conditions.   Some  empirical  evidence  are  given  in  Section  4  to 
show  that  hybrid  clustering  is  a  useful  tool  for  identifying  high  density 
clusters. 
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2.   THE  HYBRID  ALGORITHM 

The  algorithm  consists  of  two  stages:  At  the  first  stage,  the 
observations  are  clumped  into  k  clusters  by  k-means,  so  that  no  movement 
of  an  observation  from  one  cluster  to  another  reduces  the  within  cluster 
sum  of  squares.   A  histogram  estimate  is  then  constructed  on  the  k  regions 
defined  by  the  k-means  partition.   At  the  second  stage,  the  distance 
between  neighbouring  clusters  is  taken  inversely  proportional  to  the 
density  estimate  at  a  point  halfway  between  the  cluster  means,  and  single 
linkage  is  applied  to  the  distance  matrix  to  obtain  the  tree  of  high 
density  clusters  corresponding  to  the  histogram  estimate  of  the  density. 
In  one  dimension, the  algorithm  works  as  follows:  a  histogram  consisting  of 
k  intervals  in  constructed  on  the  k  clusters  obtained  by  k-means;  in  the 
second  stage,  using  the  computed  density  estimates,  neighbouring  clusters 
are  then  joined  successively  to  give  the  tree  of  sample  clusters.   Since 
the  k-means  procedure  provides  a  practicable  and  convenient  way  of 
obtaining  a  k-partition  of  multivariate  data,  the  generalization  to  p 
dimensions  (p  >1)  is  immediate. 

2.1   The  K-means  step 

A  k-means  partition  will  be  taken  to  be  a  partition  into  k  clusters 
such  that  no  observations  can  be  moved  from  one  cluster  to  another  without 
increasing  the  within  cluster  sum  of  squares.   There  are  a  number  of  ways 
of  reaching  such  a  partition  by  transferring  observations  to  reduce  within 
cluster  sum  of  squares  (see,  for  example,  Fartigan  and  Wong  1979);   the 


number  of  computations  is  usually  proportional  to  Nklp  where  N  is  the 
number  of  observations,  k  is  the  number  of  clusters,  I  is  the  number  of 
iterations  reallocating  all  observations,  and  p  is  the  number  of  dimensions. 
The  asymptotic  properties  of  k-means  as  a  clustering  technique  (  as  N 
approaches  °°  with  k  fixed)  have  been  studied  by  MacQueen  (1967),  Hartigan 
(1978),  and  Pollard  (1979).   In  this  application,  however,  it  is  usee 
primarily  as  a  density  estimation  procedure. 

The  asymptotic  properties  of  k-means  as  a  procedure  for  providing  a 
histogram  estimate  of  the  density  are  given  in  Wong  (1980).  In  one  dimension, 
the  following  density  estimate  is  shown  to  be  uniformly  consistent  in 
probability: 
Lemma   (Wong  1930,  Corollary  7): 

Let  y  »  ....,  x^  be  a  random  sample  from  some  population  F  on  [a,b]  . 
Suppose  that  the  density  f  is  four  times  dif f erentiable  and  is  strictly 
positive  on  [a,  b  ]  .   Consider  a  locally  optimal  k-means  partition  of  the 
sample  with  k^  clusters.   Let  n.  be  the  number  of  observations  in  the  jth 


3 


cluster   (j  =  1,  . . . ,  k  ) 


And  let  x.  and  WSS .  respectively  be  the  sample  mean  and  within  cluster  sum 

of  squares  of  the  jth  cluster  (j=  1,  ...,  k T) . 

Define  the  pooled  density  estimate  at  a  point  x  between  neigbouring  cluster 


means   ;•: .  ,   and  x.  by  : 
j-1       J 


3/2 


fN  (x)   =   (n.  +  n   )     /  N  (12WSS.*)2,  x   <  x  <x.  (j=2,...,  k^) 

3/2  j. 

=   (n   +  n.,)'    /  N  (12WSS  *K  ,   a  <x  <x±    ; 


3/2 
1;N 


n.   ,)'    /N  (12WSS,  *)  \      x<      x  <b, 
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where  WSS.*  =   WSS.+  WSS.  ,  +  k(n .   +  n.  ,)  .  (v.  -  x   ,)2. 
J         J     j-1      J     3-1     1  3-1 

1/3 
Then  provided  that  k  =  o  (  [N/log  NJ     ), 


sup 

N 


a  <x  <b   |f„(x)  -  f(x)|   =  o_(l). 


Unfortunately,  the  univariate  results  cannot  be  easily  generalized 
to  the  multivariate  case.   However,  let  us  assume  that  in  many  dimensions 
(R  ,  p  >1),  the  ith  cluster  consists  of  a  regular  isotope  of  volume  v.  centred 

9  / 

on  the  cluster  mean  x..   Then  WSS.  in.  v.      and  n.  «f(x.)v..   It  follows 

l  ill  ill 

that   f("j  ■x    n.1+p/2   WSS."P/2,   and  hence  ~  . 1+p/2  WSS . ~p/2   can  be 
i     l  l     '  ii 

interpreted  as  an  estimate  of  cf(x.)  where  c  is  some  proportionality 

constant.   And  for  adjoining  clusters  i  and  j,  it  is  conjectured  that  a 

consistent  oooled  estimate  of  the  densitv  at  v . . ,  the  micnoint  between  x . 

n  i 

and  x .  ,  is  given  by 

f  fe±.)  «(n±  +  n.)1+p/2  /  [WSS.  +  WSS.  +  k(n.    -n . ) -d2  £±,   x  )  ]  ?/2,(2.1) 
where  d  is  the  Euclidean  distance.  (Note  that  when  p  =  1,  f  (v,,)  is  the 
estimate  given  in  the  Lemma.)   The  assumptions  are  plausible  in  two 
dimensions  ss   k-mean  clusters  are  likely  to  be  regular  hexagons  when  k  is 
large,  but  in  three  or  more  dimensions,  it  is  not  clear  that  the  best 
partition  is  into  regular  isotopes.   Here,  the  within  cluster  sums  of 
squares  are  being  used  to  measure  the  volume  of  the  clusters,  which  is 
acceptable  if  all  clusters  are  approximately  the  same  shape.   The  volumes 
could  be  computed  directly  but  at  great  computational  expense  in  many 
dimensions.   Much  work  has  yet  to  be  done  to  prove  the  conjecture  for  two 
or  more  dimensions. 
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2.2.   The  Single  linkage  step 

The  k-means  step  produces  k  clusters  with  cluster  means  :;,...., >-  . 

X         K 

The  single  linkage  step  constructs  hierarchical  high  density  clusters  or. 
the  k  clusters  using  the  density  estimates  f   obtained  in  the  k-means  step. 
The  following  property  of  single  linkage  recommends  its  use  in  this  cluster- 
formation  stage  of  the  algorithm:   At  a  given  distance  level  D*,  any  two 
objects  in  the  same  single  linkage  cluster  can  be  connected  by  a  chain  of 
links  of  objects  such  that  the  size  of  each  link  is  no  greater  than  D*. 
Thus,  if  the  distances  between  connected  clusters  are  reciprocal  to  the 
density  estimates  ftT,  every  resulting  single  linkage  cluster  corresponds 
to  a  maximal  connected  region  of  the  form  {  x  !  f„(x)  >  c  }. 

Hence,  a  distance  matrix  is  computed  for  the  k  clusters  as  follows: 
Two  clusters  i  and  j  are  said  to  be  connected  if  x. .,  the  midpoint  between 


x.   and  x    is  closer  to  x.  (or  x. 

1       j  i      J 


)  than  any  other  cluster  mean.   If  clusters 


i  and  j  are  connected,  then  D(i,j)  =  f    (x  .  . )  ;  otherwise  D(i,j)  =  °°  .(See 
(2.1) for  definition  of  fN) .   Single  linkage  clusters  are  then  computed  from 
this  distance  matrix  to  give  the  sample  high  density  clusters. 

Next,  we  will  examine  the  asymptotic  consistency  of  the  hybrid 
algorithm  for  high  density  clusters. 
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3.   CONSISTENCY   OF  HYBRID   CLUSTERING 
FOR  HIGH  DENSITY   CLUSTERS 

Let  f  denote  a  density  on  [a,b]  such  that  I  Xjf(x)  >c}  is  the 

union  of  a  finite  number  of  closed  intervals  for  every  c  >0.   Let  T  be 

the  tree  of  population  high  density  clusters  defined  on  f.   Let  x,  , . . . .  ,x 

be  a  random  sample  from  f  and  let  T„T  be  the  hierarchical  clustering  specified 

N 

by  the  hybrid  algorithm. 


Theorem:  Suppose  that  A  and  B  are  any  two  disjoint  high  density  clusters 

T.   Assume  that  f  is  positive  and  has  four  bounded  derivatives  in  [a,b] 

1/3 
Then  provided  that  k  = o  (  [N/log  N]     ),  there  exist  A   B   sT  with 

N  -\   -\   a 

A  =  A  n  (X.,  , ,XrT  }  ,   B,.=3  3(){  X1}....  ,:vT}        .  n  D         ._, 

N       1'     '  N   '    N         1       N    ,  and  A  0  B   =  $    ,    with 

probability  tending  to  1  as  N  -»-  °°. 

Proof:  Since  T   is  the  tree  of  high  density  clusters  for  f   (see  Lemma)  , 
this  theorem  is  a  direct  consequence  of  the  Lemma,  which  states  that 

sup 

a<x<b   jf('x)   -  f  (x)  |  =   o  (1).  (3.1) 

-  -  >,         p 

By  definition,  for  any  two  disjoint  high  density  clusters  A  and  B  in  T, 
there  exist  z>   0  and  \>  0  such  that 

f(x)   >  X  for  all  x  £  A  u  B,  (3.2) 

and  A  and  3  are  separated  by  a  region  V,  where 

f(x)<X-  3e  for  all  x  e  V.  (3.3) 


m 


From    (3.1),    we  have 
sup 
{    a£x<b        |  f(x)    -  fN(x)   |    <    e    }-   1 


p 
r 


Thus,  it  follows  from  (3.2)  and  (3.3)   that  for  N  large,  with  high 
probability, 

f„(x)>X-   e         fcr  all  x  eAUB,  (3.4) 

and   f  (x)  <A  -  2  e         for  all  x  e  V.  (3.5) 

Since  A  and  B  are  disjoint,  it  follows  from  (3.4)  and  (3.5)  that  high 
density  clusters  of  the  form  {x  \  f„(x)  fc.  A  -  z]  separate  A  and  B.  The 
theorem  follows. 

The  above  theorem  shows  that  the  hybrid  algorithm  is  set-consistent 
in  one  dimension,  for  densities  f  on  [  a,b  ]  which  are  positive  and  have 

four  derivatives,  and  for  k-means  partitions  into  kT  clusters  where 

3 
k   log  N/N->  °°  ,  k  ■>  <»  ,  as  N->  °°  .We  conjecture  a  similar  result  will 

hold  in  two  dimensions.   The  higher  dimensional  case  requires  further  study; 

empirical  results  suggest  that  hybrid  algorithm  is  useful  for  identifying 

high  density  clusters. 
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4.   EMPIRICAL   VALIDATION   OF   THE  HYBRID  ALGORITHM 

The  hybrid  method  was  applied  to  various  generated  data  sets  to 
test  for  its  effectiveness  in  specifying  high  density  clusters  (Wong  1979) . 
Results  of  three  of  the  experiments,  one  using  univariate  data  and  the 
other  two  bivariate,  are  reported  here. 

1.  Experiment  One:  In  this  experiment,  1000  observations  are  drawn  from 
the  univariate  normal  mixture  ^N(0,1)  +  'sN  (3,1).   This  data  set  is 
useful  in  showing  the  performance  of  the  hybrid  algorithm  when  two  high 
density  clusters  are  separated  by  a  region  of  moderate  density.   The 

density  estimates  ever  the  intervals  between  the  k=40  cluster  means  are 

1  /3 

plotted  in  Figure  A.  (A  rough  rule  of  thumb  for  k  is  7(N/log  N)    .). 

Although  the  minimum  density  between  the  modes  is  more  than  half  the 
density  at  the  modes,  the  hybrid  algorithm  would  still  produce  a 
hierarchical  clustering  which  clearly  indicates  the  presence  of  two  modal 
regions  (see  Figure  B) . 

2.  Experiment  Two:  Here,  a  sample  of  size  1000  is  taken  from  the  bivariate 
normal  mixture  h   BVN  [(0,0),  (J  ?)]   +  h   BVN  [(3,3),  (J  ?)]  .   There  are 

'J   1  U   -i. 

two  widely  separated  spherical  clusters  in  this  data  set.  The  density 

o  -1 

estimates   (fv(x)  ■  n.   WSS..   )  over  the  k=40  clusters  obtained  by  k-means, 

and  the  resulting  hybrid  clusters  are  given  respectively  in  Figures  C  and 

D.   As  do  most  other  hierarchical  clustering  algorithms,  the  hybrid  method 

identifies  correctly  the  two  distinct  rrodes  in  the  population. 
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3.   Experiment  Three:   In  this  experiment,  the  sample  of  size  1000  from 

,  mixture  JjBVN  [(0,0),  (9Q   °)  j  +  JjBVN  [(0,6),  (jj  °)]resembles  two 
elliptical  clusters  with  a  moderate  amount  of  noise  points  between  them. 
The  hybrid  algorithm  identifies  the  two  clusters  correctly  (see  Figures  E 
and  F) ,  while  all  of  the  standard  joining  techniques  like  single  linkage 
and  complete  linkage  fail  to  do  so. 

The  CPU  time  consumed  on  the  IBM  370/58  in  the  three  examples  are 
10.9,   12.6,   16.8,   seconds  respectively.   Hence,  the  hybrid  algorithm  :; 
be  considered  as  a  practicable  and  consistent  method  for  identifying 
density  clusters. 
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